I had used R a little previously, so getting started was easy. I managed to install all required packages and knit Exercise Set 1.
Dealing with Github has been quite annoying because the website has been really slow to update after pushing.
R for Health Data Science is helpful for getting started, but on some topics I would prefer more information and examples. I found Exercise Set 1 a useful companion to the book.
I created “data” folder, and added R script “create_learning2014.R” in the folder. The script reads the learning2014 data from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt, and creates an analysis dataset which is saved in the data folder as learning2014.csv.
First we download the learning2014 data. In the file header is given in the first row, and the data separator is “,”.
# read data
students2014 <- read.table("data/learning2014.csv", header = T, sep = ",")
# print dimesions of the table
dim(students2014)
## [1] 166 7
# print first 6 rows
head(students2014)
## gender age attitude points deep surf stra
## 1 F 53 37 25 3.583333 2.583333 3.375
## 2 M 55 31 12 2.916667 3.166667 2.750
## 3 F 49 25 24 3.500000 2.250000 3.625
## 4 M 53 35 10 3.500000 2.250000 3.125
## 5 M 49 37 22 3.666667 2.833333 3.625
## 6 F 38 38 21 4.750000 2.416667 3.625
Dataset has 7 variables and 166 properties (rows). The variables are:
## [1] "gender" "age" "attitude" "points" "deep" "surf" "stra"
Gender is M (Male), or F (Female), and age
is given in years. Attitude describes the students total attitude toward
statistics. Points are students’ exam results. deep,
surf, and stra are averages taken over the
students answers to questions on deep, surface and strategic
learning.
Next, we can plot graphical overview of the data using
GGally and ggplot2 libraries. The variables
are plotted separated in terms of gender.
# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(students2014[-1], mapping = aes(col = students2014$gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
The plots in the diagonal show the density distributions of each value (red for female, blue for male). Scatter plots below the diagonal show how each pair of variables are related.
Above the diagonal are shown total and gender specific correlations
between each pair of variables. The number of stars ’*’ next to each
value tells how statistically significant the correlations are. The
largest total correlation is between attitude and
points. Interestingly, for male there is a very negative
correlation between deep and surf, while for
female these two are completely uncorrelated.
We see that attitude, stra, and
surf have the largest correlation with points.
Thus, we choose them as explanatory variables and fit them to a
regression model with exam points as the target variable.
# create a regression model with multiple explanatory variables
m <- lm(points ~ attitude + stra + surf, data = students2014)
# print out a summary of the model
summary(m)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Looking at the summary, surf is does not have a
statistically significant relationship (i.e. p-value is too large) with
the target variable. We can remove it and refit the model.
m2 <- lm(points ~ attitude + stra, data = students2014)
summary(m2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
P-values for the remaining variables are now smaller, and the model’s
explanatory power is better. There is strong positive correlation
between attitude and the target variable, but
stra (strategic learning) has also a small effect.
Multiple R-squared tells how well the variables explain the variation
of the target. In this case only 20.48 % of variation of exam points is
explained by attitude and stra. The
explanatory power of the model is quite low.
We can use the model produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
# Plot residuals vs fitted values
plot(m2, which = c(1))
The model assumes a linear regression. If the “Residuals vs Fitted” plot were curved, we might need to include a quadratic term in the model. Based on the plot, a linear model is sufficient. Points 56, 35, and 145 are the biggest outliers.
# Plot Q-Q residuals
plot(m2, which = c(2))
“Q-Q Residuals” follows fairly closely to the straight line apart from a few outliers. This shows that the prediction errors are close to a normal distribution.
# Plot residuals vs Leverage
plot(m2, which = c(5))
Leverage is a measure of how much influence each point has on the model prediction. In “Residuals vs Leverage”, we see that all leverage values are small, but some points have a proportionally larger effect.
I downloaded data from http://www.archive.ics.uci.edu/dataset/320/student+performance,
and added R script create_alc.R in the folder. The scrips
merges two datasets of student alcohol consumption, and saves the result
in the data folder as alc_use.csv. We only
keep students present in both data sets. There are 370 students and 35
varibles in the joined data.
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Firstly, we can download and glimpse at the data from
alc_use.csv:
# access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# read data
alc <- read.table("data/alc_use.csv", header = T, sep = ",")
# glimpse at the data
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
The names of variables in the dataset are:
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Two new columns have been added to the joined data:
alc_use is the average of the answers related to weekday
and weekend alcohol consumption, and high_use is set
TRUE for students for which alc_use is greater
than 2 (FALSE otherwise).
The purpose of this analysis is to study the relationships between
high/low alcohol consumption and some of the other variables in the
data. For example, we can create box plots of the final grades
(G3) in terms of high use:
# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("grade")
Grades of male high alcohol users are lower on average, but not for females. Because of the high variance in either case, grade itself is not a good predictor of alcohol use.
Here, I chose the following 4 variables in the data which appear to
most strongly correlate with alcohol use: sex,
absences, failures, and goout
(frequency of going out with friends). The number of absences can be
examined with a boxplot:
# initialize a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 + geom_boxplot() + ylab("absences")
High alcohol use correlates positively with the number absences.
Similar results can be seen with the number of class failures and
frequency of going out with friends. We can examining the effect of
goout and failures by creating separate bar
plots for low and high use.
# initialize a bar plot of high_use and goout
g1 <- ggplot(alc, aes(x = goout, fill=sex)) + geom_bar()
g1 + facet_wrap("high_use")
g2 <- ggplot(alc, aes(x = failures, fill=sex)) + geom_bar()
g2 + facet_wrap("high_use")
For
failures the correlation is not as noticeable, so we
can also use group_by() to calculate the average number of
class failures (divided by sex).
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 5
## # Groups: sex [2]
## sex high_use count fail absent
## <chr> <lgl> <int> <dbl> <dbl>
## 1 F FALSE 154 0.104 4.25
## 2 F TRUE 41 0.293 6.85
## 3 M FALSE 105 0.143 2.91
## 4 M TRUE 70 0.386 6.1
Next, we use logistic regression to statistically explore the relationship between the chosen variables and the binary high/low alcohol consumption variable as the target variable.
##
## Call:
## glm(formula = high_use ~ sex + absences + failures + goout, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14389 0.47881 -8.654 < 2e-16 ***
## sexM 0.97989 0.26156 3.746 0.000179 ***
## absences 0.08246 0.02266 3.639 0.000274 ***
## failures 0.48932 0.23073 2.121 0.033938 *
## goout 0.69801 0.12093 5.772 7.83e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.50 on 365 degrees of freedom
## AIC: 379.5
##
## Number of Fisher Scoring iterations: 4
Variable goout has lowest p-value, meaning it most
explanatory power in the model. sex and
absences also have a high significance, while “failures”
has a fairly low significance (high p-value).
The computational target variable in the logistic regression model is the log of odds: \[\log\left( \frac{p}{1 - p} \right).\] Therefore we apply the exponent function to obtain the modeled ratios of propabilities:
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m)
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01586098 -5.12422446 -3.2428932
## sexM 2.66415000 0.47316089 1.5009085
## absences 1.08595465 0.03922013 0.1293995
## failures 1.63121360 0.04208372 0.9518235
## goout 2.00975350 0.46657743 0.9418090
We can add prediction probabilities and predicted values
(TRUE or FALSE) to the table, and perform 2x2
cross tabulation of predictions versus the actual values.
# add prediction probabilities and predicted values (TRUE or FALSE)
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 243 16
## TRUE 61 50
In the table, inaccurately classified individuals are found in the off-diagonal values.
# table of proportional values
t <- table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
t
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65675676 0.04324324 0.70000000
## TRUE 0.16486486 0.13513514 0.30000000
## Sum 0.82162162 0.17837838 1.00000000
# total fail rate
(t[1,2] + t[2,1])
## [1] 0.2081081
The training error (i.e. the percentage of wrong predictions) is 20.8%. The prediction accuracy is not too bad.
However, we can simplify the model by dropping both
failures and absences:
m <- glm(high_use ~ sex + goout, data = alc, family = "binomial")
# Summarise the model
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + goout, family = "binomial", data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8193 0.4572 -8.354 < 2e-16 ***
## sexM 0.9268 0.2507 3.696 0.000219 ***
## goout 0.7578 0.1190 6.369 1.9e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 388.94 on 367 degrees of freedom
## AIC: 394.94
##
## Number of Fisher Scoring iterations: 4
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
# table of proportional values
t <- table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
# total fail rate
(t[1,2] + t[2,1])
## [1] 0.2135135
Failure rate is very similar at 21.3%, so clearly not all variables are necessary needed to get good predictions.
Finally, let’s perform 10-fold cross-validation on the original model:
# loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2432432
The error rate varies between runs, but on average it is around 21%. This model is clearly better than the one introduced in Exercise Set 3.
First we load the Boston data from the MASS package.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
Boston dataset consists of information collected by the U.S Census
Service concerning housing in the Boston suburbs, such as per capita
crime rate (crim). The data frame has 506 rows and 14
columns.
# print variable names and dimensions of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
We can visualize the relationships between variables by calculating and plotting a correlation matrix:
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
round(cor_matrix, digits = 2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")
For example, crime correlates the most with accessibility to radial
highways (rad) and property-tax rate
(tax).
Summarising the data variables:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
For each variable summary prints out the mean, median,
minimum and maximum values, and the 1st and 3rd quartiles.
Next we want to standardize the dataset so that them mean is 0 and standard deviation is 1 for each variable.
# center and standardize variables
boston_scaled <- as.data.frame(scale(Boston))
# make crime rate variable numeric
boston_scaled$crim <- as.numeric(boston_scaled$crim)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We can create a categorical variable of the crime rate by dividing the scaled crime rate between the four quantiles.
# divide data to bins
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# print table of the new variable
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# drop the old crime rate variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Finally, we divide the dataset to train and test sets with 80% of the data in the train set:
# choose randomly 80% of the rows
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
# create train and test set
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
We use linear discriminant analysis on the train set with crime rate as the target variable and all the other variables as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2648515 0.2277228 0.2574257
##
## Group means:
## zn indus chas nox rm age
## low 1.0238605 -0.9793461 -0.11640431 -0.9088110 0.43724680 -0.9170710
## med_low -0.1097611 -0.2709396 -0.05155709 -0.5520220 -0.21131780 -0.3077811
## med_high -0.3837758 0.2204956 0.07002747 0.4516621 0.08314622 0.3964594
## high -0.4872402 1.0170690 -0.04518867 1.0572915 -0.41341281 0.8051324
## dis rad tax ptratio black lstat
## low 0.9723729 -0.6725811 -0.7196436 -0.41987151 0.37766136 -0.789035129
## med_low 0.3382136 -0.5546844 -0.5055807 -0.09083605 0.34788771 -0.074058686
## med_high -0.3986553 -0.4126310 -0.2766658 -0.22899005 0.09008068 0.002386964
## high -0.8561968 1.6386213 1.5144083 0.78135074 -0.70489361 0.910984773
## medv
## low 0.51187718
## med_low -0.03374719
## med_high 0.16212785
## high -0.70226104
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07386515 0.58119845 -0.80823690
## indus 0.05826081 -0.42553489 0.49823325
## chas -0.11224342 0.08706484 0.13999523
## nox 0.41784425 -0.81308684 -1.48982358
## rm -0.10650187 -0.05230303 -0.22951413
## age 0.18738135 -0.19338831 -0.03984347
## dis -0.04819067 -0.14319286 0.06165281
## rad 3.13232869 0.96769635 0.36593275
## tax -0.02036994 0.16228997 0.01855767
## ptratio 0.12982447 -0.07768621 -0.36870098
## black -0.14690853 0.05621491 0.22076989
## lstat 0.20684628 -0.16516085 0.31597687
## medv 0.19459127 -0.37616148 -0.30356069
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9421 0.0418 0.0161
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
We can test the accuracy of the LDA model with predict
and cross tabulate the results.
# separate the correct classes from test data
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 8 4 0
## med_low 6 11 2 0
## med_high 0 15 18 1
## high 0 0 0 23
66 / 102 = 64.7% of classes were correctly predicted. High and low crime rate predictions are mostly correct, while the middle crime rate prediction accuracy is quite low.
Similarity between measurements can be measured by calculating the euclidian distances between observation:
# euclidean distance matrix
dist_eu <- dist(Boston)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
K-means clustering divides the dataset into k clusters based on their distances.
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Optimal number of clusters is found when the total of within cluster sum of squares (WCSS) is drops radically. In this case the optimal k is 2. Plotting the cluster in part of the dataset, separated by colors:
# k-means clustering
km <- kmeans(Boston, centers = 2)
# plot part of the dataset with clusters
pairs(Boston[c("rm", "age", "dis", "crim")], col = km$cluster)
K-means clustering with k=3:
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
km <- kmeans(boston_scaled, centers = 3)
pairs(boston_scaled[c("rm", "age", "dis", "crim")], col = km$cluster)
Perform LDA using the clusters as target classes:
train$crime <-as.numeric(train$crime)
km <- kmeans(train, centers = 3)
# linear discriminant analysis
lda.fit <- lda(km$cluster ~ ., data = train)
lda.fit
## Call:
## lda(km$cluster ~ ., data = train)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3836634 0.2772277 0.3391089
##
## Group means:
## zn indus chas nox rm age dis
## 1 0.7431953 -0.9170888 -0.1961271 -0.9005622 0.4694583 -1.0050277 0.9499061
## 2 -0.4872402 1.0575658 -0.0262603 1.0019150 -0.4042927 0.7604060 -0.8346874
## 3 -0.4010165 0.1595450 0.1300023 0.2045813 -0.3013126 0.4763877 -0.3289978
## rad tax ptratio black lstat medv crime
## 1 -0.6017657 -0.7065261 -0.4145458 0.3648854 -0.7741460 0.5665283 1.458065
## 2 1.5775695 1.5244331 0.8086653 -0.6500602 0.8911452 -0.6951667 3.892857
## 3 -0.5711056 -0.4084738 -0.1332044 0.1941297 0.2009457 -0.1458682 2.518248
##
## Coefficients of linear discriminants:
## LD1 LD2
## zn 0.154085551 0.06103328
## indus 0.391385720 0.15527376
## chas 0.080419215 0.24912662
## nox -0.230489323 -0.18249444
## rm -0.061577202 -0.30648071
## age 0.243440665 0.90671299
## dis -0.272545600 -0.43557841
## rad 3.025058848 -1.84037354
## tax 1.202881792 -0.13433918
## ptratio 0.242578180 0.00851564
## black -0.004462324 0.04236079
## lstat 0.313414629 -0.20396002
## medv 0.072512784 -0.36100594
## crime -0.070030332 1.13766168
##
## Proportion of trace:
## LD1 LD2
## 0.9163 0.0837
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2, col = km$cluster, pch = km$cluster)
lda.arrows(lda.fit, myscale = 1)
The most influential linear separators for the clusters are
tax and rad.
Added create_human.R to data folder, which
downloads “Human development” and “Gender inequality” datasets, joins
them and saves the data to human.csv.
<!DOCTYPE html>
Further data manipulation is done in create_human.csv, and
dataset human.csv is saved to data folder.
There are now 155 observations and 9 variables. We excluded unneeded
variables, and then filtered out rows with missing (NA)
values or which relate to regions instead of countries.
We open the human data and move the country names to row
names:
# access libraries
library(tidyr); library(dplyr); library(ggplot2); library(readr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# read data from human.csv
human <- read_csv('data/human.csv', show_col_types = FALSE)
library(tibble)
human <- column_to_rownames(human, "Country")
We can summarise the data with ggplot and
corrplot:
library(GGally);library(corrplot)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## corrplot 0.92 loaded
# visualize the variables
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()
We can deduce that life expectancy at birth Life.Exp has a
strong positive correlation with expected years of schooling
Edu.Exp. On the other had maternal mortality ratio
Mat.Mor and adolescent birth rate Ado.Birth
have a strong negative correlation with Life.Exp and
Edu.Exp. Gross national income per capita GNI
shows similar, though weaker, correlations. Percentange of female
representatives in parliament Parli.F, and relative labor
force participation Labo.FM correlate with each other but
not very significantly with other variables.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
s <- summary(pca_human)
# draw a biplot
biplot(pca_human, choices = 1:2, cex = c(1, 0.8), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
PC1 and PC2 are the two most significant axis found by PCA. The
percentages of variables are shown is brackets. Pink arrows represent
the original variables. Unfortunately, the plot is useless, because
GNI values are much larger than other variables causing PCA
to overvalue its impact.
We standardize the data and perform PCA again
# Standardize data, and perform PCA
pca_human <- prcomp(scale(human))
s <- summary(pca_human)
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The lengths of the arrows are now equal because the variables have same
standard deviations. The labels are not very clear, but
Parli.F and Labo.FM dominate in the PC2 axis
and the other variables in the PC1 axis with directions of arrows
corresponding with positive and negative correlations.
We can deduce that, not very surprisingly, people in more developed countries are better educated, have higher life expectancy, and have better healthcare outcomes (lower maternal mortality). Additionally, countries were women make a larger percentage of the workforce, also have more female representatives, but this does not necessarily indicate higher development.
Next, we want to analyze the tea data from the FactoMineR package.
# read tea data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
There are 300 observables and 36 variables. We pick 6 of the variables for analysis.
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, any_of(keep_columns))
# a quick look at the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
# visualizing the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + geom_bar() + facet_wrap("name", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
We perform Multiple Correspondence Analysis on the dataset:
# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
Like with PCA we can draw a biplot of results.
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
We can see some correlation on how people consume tea, for example, at a tea shop, tea comes unpackaged. However, Dim 1 and 2 cover a fairly small percentage of variances, so does not tell much about the correlations. There is some difference between tea types and additives but enough to draw conclusions.